grid blackout
The Download: AI benchmarks, and Spain's grid blackout
SWE-Bench (pronounced "swee bench") launched in November 2024 as a way to evaluate an AI model's coding skill. It has since quickly become one of the most popular tests in AI. A SWE-Bench score has become a mainstay of major model releases from OpenAI, Anthropic, and Google--and outside of foundation models, the fine-tuners at AI firms are in constant competition to see who can rise above the pack. Despite all the fervor, this isn't exactly a truthful assessment of which model is "better." Entrants have begun to game the system--which is pushing many others to wonder whether there's a better way to actually measure AI achievement.
Country:
- Europe > Spain (0.61)
- North America > United States > California (0.08)
- Europe > Portugal (0.08)
- Europe > France (0.08)